Replacement - Handling Failures in a Replicated State Machine
نویسندگان
چکیده
State machine replication is a common approach for building fault-tolerant services. A Replicated State Machine (RSM) typically uses a consensus protocol such as Paxos [1] to decide on the order of updates and thus keep replicas consistent. Using Paxos, the RSM can continue to process new requests, as long as more than half of the replicas remain operational. If this bound is violated, however, the current RSM is forced to stop making progress indefinitely. To avoid scenarios in which the number of failures exceeds the bound, it is beneficial to immediately instantiate failure handling, if this can be done without causing a significant disruption to request execution. This can be done by reconfiguration, which is a general method to replace one set of replicas with another. Classical reconfiguration relies on the RSM to decide on a reconfiguration command [2]. For this, the old configuration must have a majority of operational replicas and a single correct leader. The latter can only be guaranteed if the replicas are sufficiently synchronized. In this paper, we present Replacement [3], a reconfiguration algorithm specialized for replacing a faulty replica with a new one. Also Replacement requires a majority of operational replicas. However, different from traditional reconfiguration techniques, failure handling with Replacement does not rely on consensus. Thus, by using Replacement, faulty replicas can be replaced even during times of asynchrony, e.g. when clocks are not synchronized and the network experiences unpredictable delays, or when multiple replicas are competing for leadership. This is useful, since replacing slow or overloaded replicas can restore synchrony and replaced replicas can no longer compete for leadership. In [4] we showed that reconfiguration without consensus is possible. However, the algorithm presented in [4] (ARec), has to stop the state machine during reconfiguration. Replacement, our new method, includes minor adjustments to the Paxos algorithm that allow the RSM to make progress, while replicas disagree on the current configuration. It thus avoids the increased client latency and temporary unavailability, caused by ARec.
منابع مشابه
Performance Prediction of a Flexible Manufacturing System
The present investigation presents a stochastic model for a flexible manufacturing system consisting of flexible machine, loading/unloading robot and an automated pallethandling device. We consider unreliable flexible manufacturing cell (FMC) wherein machine and robot operate under individual as well as common cause random failures. The pallethandling system is completely reliable. The pallet o...
متن کاملA One-Stage Two-Machine Replacement Strategy Based on the Bayesian Inference Method
In this research, we consider an application of the Bayesian Inferences in machine replacement problem. The application is concerned with the time to replace two machines producing a specific product; each machine doing a special operation on the product when there are manufacturing defects because of failures. A common practice for this kind of problem is to fit a single distribution to the co...
متن کاملModelling and Decision-making on Deteriorating Production Systems using Stochastic Dynamic Programming Approach
This study aimed at presenting a method for formulating optimal production, repair and replacement policies. The system was based on the production rate of defective parts and machine repairs and then was set up to optimize maintenance activities and related costs. The machine is either repaired or replaced. The machine is changed completely in the replacement process, but the productio...
متن کاملDeveloping a cellular manufacturing model considering the alternative routes, tool assignment, and machine reliability
The cell formation (CF) is one of the most important steps in the design of a cellular manufacturing system (CMS), which it includes machines’ grouping in cells and part grouping as separate families, so that the costs are minimized. The various aspects of the problem should be considered in a CF. The machine reliability and the tool assigned to them are the most important problems which have t...
متن کاملDistributed Wikis on Structured Overlays
We present a transaction processing scheme for structured overlay networks and use it to develop a distributed Wiki application that is based on a relational data model. The Wiki supports rich metadata and additional indexes for navigation purposes. Ensuring consistency and durability requires handling of node failures. We mask such failures by providing high availability of nodes by constructi...
متن کامل